PaperLuc 1 Semantic Foggy Scene Understanding with Synthetic Data

三月 02, 2019

cover

迷雾重重。

科研不易。 # 主要工作

这项工作解决了语义模糊场景理解（SFSU）的问题。尽管已经对图像去雾以及使用clearweather图像进行语义场景理解进行了广泛的研究，但很少关注SFSU。由于收集和注释模糊图像的困难，我们选择在描绘晴朗天气室外场景的真实图像上生成合成雾，然后通过采用最先进的卷积神经网络将这些部分合成数据用于SFSU（ CNN）。特别是，开发了一种完整的管道，用于使用不完整的深度信息将合成雾添加到真实的晴朗天气图像中。我们在Cityscapes数据集上应用雾合成，并生成具有20550个图像的Foggy Cityscapes。 SFSU以两种方式解决：1）具有典型的监督学习，2）具有新型半监督学习，其结合1）与从无声监视图像到其合成有雾对应物的无监督监督转移。此外，我们仔细研究了SFSU图像去雾的有效性。为了进行评估，我们展示了Foggy Driving，这是一个包含101个真实世界图像的数据集，描绘了模糊的驾驶场景，其中包含用于语义分割和物体检测的地面实况注释。大量实验表明，1）使用我们的合成数据进行监督学习，显着提高了SFSU在Foggy Driving上的最先进CNN的性能; 2）我们的半监督学习策略进一步提高了绩效; 3）图像去雾使我们的学习策略略微提升了SFSU。
数据集，模型和代码可公开获取。

来源：谷歌翻译

main contributions：

n automatic and scalable pipeline to impose high-quality synthetic fog on real clear-weather images;
two new datasets, one synthetic and one real, to facilitate training and evaluation of models used in SFSU;
a new semi-supervised learning approach for SFSU;
a detailed study of the benefit of image dehazing for SFSU and human perception of foggy scenes.

主要关注foggy cityscapes的pipeline。

重点工作

Optical Model of Choice for Fog

使用经典的雾模型（大气光均匀情况下）：

I(x) = R(x)t(x) + L(1 − t(x))

均匀介质下：

t(x) = exp (−βl(x))

The parameter β is named attenuation coefficient and it effectively controls the thickness of the fog: larger values of β mean thicker fog

同时使用MOR模型：

meteorological optical range (MOR)

also known as visibility, is defined as the maximum distance from the camera for which t(x) ≥ 0.05

exp (−βl(x)) ≥ 0.05

l(x) ≤ - ln0.05 / β = 2.996 / β

MOR = max(l(x)) = 2.996 / β

Fog decreases the MOR to less than 1 km by definition

β ≥ 2.996 × 10−3 m−1

where the lower bound corresponds to the lightest fog configuration.

此为final版本数据，在原始版本中t(x) ≥ 0.02, 此时β ≥ 3.912 × 10−3 m−1

因此主要任务为：

estimation of t, and
estimation of L from R.

Step 2 is simple: we use the method proposed in [dark channel prior] with the improvement of [Investigating haze-relevant features in a learning framework for image dehazing.].

Depth Denoising and Completion for Outdoor Scenes

为了生成准确的transmission map t 需要以下输入：

the original, clear-weather color image R to add synthetic fog on, which constitutes the left image of a stereo pair,

aachen_000001_000019_leftImg8bit.png:
the right image Q of the stereo pair,

aachen_000001_000019_rightImg8bit.png:

the intrinsic calibration parameters of the two cameras of the stereo pair as well as the length of the baseline, for instance:

aachen_000001_000019_camera.json:
{
    "extrinsic": {
        "baseline": 0.209313, 
        "pitch": 0.038, 
        "roll": 0.0, 
        "x": 1.7, 
        "y": 0.1, 
        "yaw": -0.0195, 
        "z": 1.22
    }, 
    "intrinsic": {
        "fx": 2262.52, 
        "fy": 2265.3017905988554, 
        "u0": 1096.98, 
        "v0": 513.137
    }
}

a dense, raw disparity estimate D for R of the same resolution as R, and

aachen_000001_000019_disparity.png:
a set M comprising the pixels where the value of D is missing.

以上要求都可通过一个stereo camera和一个standard stereo matching algorithm[Stereo processing by semiglobal matching and mutual information.]满足。

pipeline steps：

calculation of a raw depth map d in meters,
denoising and completion of d to produce a refined depth map d’ in meters,
calculation of a scene distance map l in meters from d’,
application of t(x) = exp (−βl(x)) to obtain an initial transmission map ˆt, and
guided filtering [Guided image filtering.] of ˆt using R as guidance to compute the final transmission map t.

The central idea is to leverage the accurate structure that is present in the color images of the stereo pair in order to improve the quality of depth, before using the latter as input for computing transmission

分别解读：

Step 1

calculation of a raw depth map in meters from the raw disparity input with:

disparity d / baseline b = focal length f / depth z

we use the input disparity D in combination with the values of the focal length and the baseline to obtain d.

The missing values for D, indicated by M, are also missing in d.

% Compute the depth as inversely proportional to disparity. Wherever the
% disparity is zero, the depth is equal to infinity. Convention: any known
% invalid disparity is assigned infinite depth.
depth_map_in_meters = zeros(size(disparity_in_pixels));
depth_map_in_meters(~is_disparity_invalid & ~is_disparity_zero) =...
    B * f_x ./ disparity_in_pixels(~is_disparity_invalid & ~is_disparity_zero);
depth_map_in_meters(is_disparity_zero | is_disparity_invalid) = Inf;

aachen_000001_000019_initdepth.png:

init_depth

Step 2

denoising and completion of the above raw depth map to produce a refined depth map in meters

use a superpixel segmentation of the clear image R to guide depth denoising and completion at the level of superpixels, making the assumption that each individual superpixel corresponds roughly to a plane in the 3D scene.

detailed steps:

apply a photo-consistency check between R and Q, using the input disparity D to establish pixel correspondences between the two images of the stereo pair,

consistency check:

对于大多数图像完成方法，没有合适的方法来检测视觉伪像，因为算法认为结果是最佳的。通过提供附加图像并通过独立和同时对两个图像进行修复，可以通过一致性检查自动检测潜在的错误解决方案。

假设场景中的表面接近Lambertian，则可以基于相应像素的颜色一致性来检测不可靠的修复结果。

% Photo-consistency check to create binary mask |M_L| with unreliable depth.
epsilon = 12 / 255;
photoconsistency_outlier_mask = outliers_photoconsistency(left_image,...
    right_image, left_disparity, epsilon);
M_L = photoconsistency_outlier_mask | is_depth_input_invalid;

segment R into superpixels with SLIC

% Segmentation of left image in superpixel-wise segments using SLIC.
% 1) Set parameters.
number_of_preferred_segments = 2048;
m = 10;
% 2) Perform segmentation. |S_L| is the image containing the segmentation result
%    for the left image. |K| is the number of output segments.
[S_L, K] = slicmex(left_image_uint8, number_of_preferred_segments, m);
% Convert segmentation result to double and one-based format for subsequent
%    operations.
S_L = double(S_L) + 1;
% Convert input image to L*a*b* colorspace for subsequent color similarity
% computation.
......
left_image_l = left_image_lab(:, :, 1);
left_image_a = left_image_lab(:, :, 2);
left_image_b = left_image_lab(:, :, 3);

处理每个符合条件的 segment:

These superpixels are classified into reliable and unreliable ones with respect to depth information, based on the number of pixels with missing or invalid depth that they contain.
1
2
3
4
%% Segment classification and plane fitting with RANSAC for reliable segments
minimum_count_known = 20;
minimum_fraction_known = 0.6;
......
使用||S −O|| > max(6, λ · ||S||)判断是否为reliable:
- if reliable: 使用 Plane fitting for (partly) visible segments方法
  
  fit a depth plane by running RANSAC on its pixels that have a valid value for depth.
  1
  2
  3
  4
  if count_known >= max(minimum_count_known,...
  minimum_fraction_known * count_pixels)
  % Segment is adequately visible. Run RANSAC for plane fitting.
  ......
使用Plane assignment for the remaining segments方法

The greedy approach is used subsequently to match unreliable superpixels to reliable ones pairwise and assign the fitted depth planes of the latter to the former.
1
2
3
4
% Assignment of unreliable segments to visible ones with greedy matching.
% Color-similarity scale parameter.
lambda = m;
......
此处具体公式和引用的 Stereoscopic Inpainting Joint Color and Depth Completion from Stereo Images 文章中方法公式有差别，此处不赘述。

aachen_000001_000019_afterdepth.png:

after_depth

Step 3

calculation of a scene distance map in meters from the refined depth map

1 2	% Compute scene distance from camera for each pixel. l = distance_in_meters_cityscapes(d, camera_parameters_file);

function distance_map_in_meters = distance_in_meters_cityscapes(...
depth_map_in_meters, camera_parameters_file)
% Retrieve relevant intrinsic camera parameters: focal length and optical
% center, both expressed in pixel coordinates.
[~, f_x, c_x, c_y] = camera_parameters_cityscapes(camera_parameters_file);

% Compute medium (i.e. air) thickness in meters from depth. The derivation of
% the formula in the final lines of code is based on similar triangles.
 [height, width] = size(depth_map_in_meters);
 [X, Y] = meshgrid(1:width, 1:height);
 distance_map_in_meters = depth_map_in_meters .*...
     sqrt((f_x ^ 2 + (X - c_x) .^ 2 + (Y - c_y) .^ 2) / f_x ^ 2);
 end

Step 4

application of the transmittance formula for a homogeneous medium to obtain an initial transmittance map

1 2	% Beer-Lambert law for homogeneous medium. t = exp(-beta * l);

Step5

guided filtering of the initial transmittance map using the original clear-weather image as guidance to compute the final transmittance map

in order to smooth transmission while respecting the boundaries of the clear image R.

1 2	% Compute final transmittance map. t = transmission_postprocessing_guided_filter(t, left_image);

Input Selection for High-Quality Fog Simulation

两大改进：

first refinement criterion is whether the sky is overcast, ensuring that the light in the input real scene is not strongly directional.
second refinement criterion is whether the pixel that is selected as atmospheric light is labelled as sky, and affords an automatic implementation.

Foggy Cityscapes的生成

Foggy Cityscapes-coarse

first obtain 20000 synthetic foggy images from the larger, coarsely annotated part of the dataset
keep all of them, without applying the refinement criteria

trade the high visual quality of the synthetic images for a very large scale and variability of the synthetic dataset.
produce labellings with state-of-the-art semantic segmentation models on the original, clear images and use them to transfer knowledge from clear weather to foggy weather

Foggy Cityscapes-refined

originally 2975 training and 500 validation images
use the two criteria in conjunction to filter, obtain a refined set of 550 images, 498 from the training set and 52 from the validation set, which fulfill both criteria
Running fog simulation on this refined set provides us with a moderate-scale collection of high-quality synthetic foggy images.

Supervised Learning with Synthetic Fog

outline:

用Foggy Cityscapes-refined的合成雾图finetune一个在原始无雾的Cityscapes数据集上训练的模型；
用Foggy Driving来evaluate 1中finetune模型，并显示其性能与原始无雾模型相比有所改善。因此，除非另有说明，否则报告的结果与Foggy Driving有关。

about generalization:

all models are ultimately evaluated on data from a different domain than that of the data on which they have been fitted, revealing their true generalization potential on previously unseen foggy scenes.

Semantic Segmentation

pipeline：

使用Cityscapes的2975张图训练Dilation10模型，作为baseline: W/o FT；
使用经过四种方法（no dehazing, MSCNN, DCP, NLD）预处理过的Foggy Cityscapes-refined图finetune Dilation10作为对比模型：FT
使用经过四种方法（no dehazing, MSCNN, DCP, NLD）预处理过的Foggy Driving 评估以上模型。

Benefit of Fine-tuning on Synthetic Fog

conclusion：

all fine-tuned models outperform Dilation10 irrespective of the type of dehazing preprocessing that is applied, both for mean IoU over all classes and over frequent classes only.
结果最好的是未经过去雾处理的模型。

The best-performing fine-tuned model, which we refer to as FT-0.01, involves no dehazing and outperforms Dilation10 significantly.

Comparison of Fog Simulation Approaches

conclusion：

Our method for fog simulation consistently outperforms the two baselines and the “poly” learning rate policy allows the model to be finetuned more effectively than the constant policy.

Increasing Returns at Larger Distance

background：

雾对场景的外观的影响随着相机距离（distance/depth）增大而增大；
理想情况下，专用于雾场景的模型对场景的远处更加擅长；

procedure：

由于没有foggy driving的depth信息，使用Foggy Cityscapes-refined的验证集来evaluate结果，并根据其加雾过程中产生的distance结果专门考察不同distance区间内的性能。

conclusion：

FT-0.01 brings a consistent gain in performance across all distance ranges.
FT-0.01 is able to handle better the most challenging parts of a foggy scene

Note that most pixels in the very last distance range (more than 400m away from the camera) belong to the sky class and their appearance does not change much between the clear and the synthetic foggy images.

Generalization in Synthetic Fog across Densities

the performance of Dilation10 drops rapidly as β increases, all five fine-tuned “foggy” models are more robust to changes in β across the examined range
performance is high and fairly stable in the range [0, β(t)] and drops for β > β(t). This implies that a “foggy” model is able to generalize well to lighter synthetic fog than what was used to fine-tune it.
all “foggy” models compare favorably to Dilation10 across the largest part of the range of β, with most “foggy” models being beaten by Dilation10 only for clear weather. Note also that the performance gain with “foggy” models under foggy conditions is much larger than the corresponding performance loss for clear weather.

Effect of Synthetic Fog Density on Real-world Performance

without dehazing:

the models with β = 0.005 and β = 0.01 perform significantly better than that with β = 0.02, implying that according to point 1 Foggy Driving is dominated by scenes with light or medium fog.

with dehazing:

dehazing methods that are used for preprocessing has its own particularities in enhancing the appearance and contrast of foggy scenes while also introducing artifacts to the output.

for MSCNN:

MSCNN在去雾方面相对比较保守，在轻雾的去雾结果上表现最好，这也解释了为什么使用MSCNN进行轻雾处理的finetune模型在以轻雾为主（without dehazing中）的foggy driving数据集上效果最好。

for DCP：

DCP在去雾方面最为激进，对高浓度雾去雾效果最好

as its estimated transmission is biased towards lower values

这也解释了为什么DCP预处理的finetune模型在中等浓度雾条件下效果要好于轻雾，这是一种在去雾和最小化引入的artifacts之间的一种平衡。

NLD同DCP。

Effect of Dehazing Preprocessing on Real-world Performance and Discussion

为什么不去雾效果最好，只有MSCNN偶尔可以有更好的结果？

在进行除雾预处理时，Foggy Driving没有显着的性能提升可归因于通用和特定方法的原因。

所有去雾方法依据的均匀大气条件下的雾模型在foggy driving中可能不成立。尽管在加雾过程中也使用了该模型，但是雾图合成是一个forwad问题而去雾是一个inverse问题，因此加雾过程引入的artifacts不如去雾引入的要突出。另外，一个有趣的观点是可以让使用forward技术基于来自源域的数据来生成困难目标域的训练数据，作为使用inverse技术将这样的目标域转换为更容易的源域的替代方法；
其次，大多数流行的去雾方法所依赖的光学模型假设像素处的辐照度与处理的模糊图像中的像素的实际值之间的线性关系。因此，这些方法要求在去雾之前应用初始伽马校正步骤，否则它们的性能可能显着恶化。这反过来意味着每个图像必须知道伽玛的值，而Cityscapes和Foggy Driving则不然。手动搜索“最佳”周期值对于这些大型数据集也是不可行的。在没有任何进一步信息的情况下，我们使用了[6]推荐的作者对γ进行常数值1，这对于大多数图像来说可能不是最理想的。因此，我们希望指出，未来关于室外数据集的工作，无论是否考虑雾/雾，都应该理想地记录每个图像的伽玛值，以便去雾方法可以在这些数据集上显示它们的全部潜力。
特别是对于DCP，与MSCNN相比，性能降低，部分原因是由于Foggy Driving的光雾特性与DCP的最佳操作点不匹配。另一方面，NLD使用与我们的雾模拟，MSCNN和DCP共享的不同模型来估计大气光，因此在Foggy Cityscapes去雾面临更大的困难。

Linking the Objective and Subjective Utility of Dehazing Preprocessing in Foggy Scene Understanding

在本节中，通过研究去雾预处理对人类对雾场景的理解的效用来补充这一客观评价，并表明客观评价的比较结果与人类评价的比较结果基本一致。

User Study via Amazon Mechanical Turk

background:

Amazon Mechanical Turk (AMT) + Human Intelligence Task (HIT) + Known Answer Review Policy

procedure:

使用paired comparisons technique, 将四选一转换为二选一，因此每个场景需要进行六组比较(C42=6）；

choose the one which is more suitable for safe driving (i.e. easier to interpret).
Human Intelligence Task (HIT)包含五个图片对，其中三对是真正需要比较的，两对是有已知答案的（使用不同浓度的Foggy Cityscapes-refined，浓度高相对浓度低是错误）；
randomly shuffled and the left-right order of the images in each pair is randomly swapped

conclusion:

demonstrates that the subjects have done a decent job: for 83% of the HITs, both known-answer questions are answered correctly.

使用这83%的相对可信的结果作为之后的主观结果。

Consistency of Subjects’ Answers

conclusion：

no single option has dominant advantage over another one

Ranking and Correlation with Objective Evaluation

conclusion:

不去雾吊打任何去雾操作结果；
The no-dehazing and DCP options are ranked higher than MSCNN and nonlocal dehazing both in the user study and in the objective evaluation.
foggy driving中τ分布的正相关支持了去雾处理有利于雾场景理解的结论。

Object Detection

使用Fast R-CNN作为评估模型

We prefer Fast R-CNN over more recent approaches such as Faster R-CNN [53] because the former involves a simpler training pipeline, making fine-tuning to foggy conditions straightforward.

pipeline:

使用原作者在PASCAL VOC 2007上训练的预训练模型，用Cityscape训练集和验证集共3475（2975+500）张图finetune，作为baseline: W/o FT；
使用Foggy Cityscapes-refined训练集和验证集共550（498+52）张图在初始模型上finetune作为对比模型:

β = 0.01: FT β = 0.01; β = 0.005: FT β = 0.005

conclusion:

The overall winner is the model that has been fine-tuned on light fog, which we refer to as FT-0.005: it outperforms the baseline model by 2.4% on average on the two frequent classes and it is also slightly better when taking all 8 classes into account.

Semi-supervised Learning with Synthetic Fog

extend the learning to a new paradigm which is also able to acquire knowledge from unlabeled pairs of foggy images and clear-weather images.

pipeline:

498张高质量有标签数据集Foggy Cityscapes-refined作为Dl（包含无雾图，雾图，标签）；20000张无标签加雾数据集Foggy Cityscapes-coarse作为Du（包含无雾图，雾图）；
首先学习Dl中图像到标签的映射函数，再生成Du的标签，之后完成公式9的最优化；
使用RefineNet而非DCN按原始参数在Cityscapes上训练，作为baseline；使用Dl finetune RefineNet；使用Dl和带标签的Du finetune RefineNet；
在Foggy Driving上evaluate。

conclusion：

使用Dl和带标签的Du finetune RefineNet模型获得了最好的效果。证明人工合成的雾图finetune可以提升模型的语义雾场景理解能力。

fine-tuning with our synthetic fog can indeed improve the performance of semantic foggy scene understanding.

总结

这篇论文在整个实验进行的近一年时间内被反复翻看又反复遗忘，故属文以记之😂；
实验很重要，实验很重要，实验很重要；
FC在Fog Simulation的过程中没有对天空部分做特别处理，天空等部分的INF值来自于原始不精确的disparity估计（disparity中被标记，并在转换为depth的过程中被转换为INF）。然而step2的denoising and completion过程不但没有填补天空部分的INF值部分，反而对其进行了扩大（SLIC结果？），保留了相对完整的天空。

查看评论